Exploratory Data Analysis (cont’d): - Pairs plot to show correlation between variables and avoid multicollinearity (see 8.2 Many predictors in a model) Logistic Regression seen as an evolution of techniques - Linear Model to show simplest multivariate regression, but predictions can be outside the binary values. - Generalized Linear Model uses a logit transformation to constrain the outputs to being within two values. - Generalized Additive Model allows for “wiggle” in predictor terms. - Maxent (Maximum Entropy) is a presence-only modeling technique that allows for a more complex set of shapes between predictor and response.
librarian::shelf(
DT, dplyr, dismo, GGally, here, readr, tidyr)
select <- dplyr::select # overwrite raster::select
options(readr.show_col_types = F)
dir_data <- here("data")
pts_env_csv <- file.path(dir_data, "pts_env.csv")
pts_env <- read_csv(pts_env_csv)
nrow(pts_env)
## [1] 1552
datatable(pts_env, rownames = F)
GGally to look at pair plots and examine correlations between environmental variablesGGally::ggpairs(
select(pts_env, -ID),
aes(color = factor(present)))
# Logistic Regression ## Setup Data - drop rows of data with any NA values (later, we’ll learn how to impute values) - remove terms we don’t want to model
d <- pts_env %>%
select(-ID) %>% # remove terms we don't want to model
tidyr::drop_na() # drop the rows with NA values
nrow(d)
## [1] 1548
mod <- lm(present ~ ., data = d)
summary(mod)
##
## Call:
## lm(formula = present ~ ., data = d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.16297 -0.30808 0.06136 0.28021 1.09967
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.606e+00 9.115e-01 -2.859 0.004307 **
## WC_alt 9.884e-05 1.015e-04 0.974 0.330322
## WC_bio1 1.975e-01 1.899e-02 10.397 < 2e-16 ***
## WC_bio4 9.127e-04 2.103e-03 0.434 0.664422
## WC_bio12 -1.742e-03 2.291e-04 -7.606 4.89e-14 ***
## ER_climaticMoistureIndex 1.776e+00 2.233e-01 7.954 3.47e-15 ***
## ER_tri 4.954e-03 1.340e-03 3.696 0.000227 ***
## ER_topoWet 3.068e-02 1.607e-02 1.909 0.056411 .
## lon 2.544e-02 2.550e-03 9.975 < 2e-16 ***
## lat 1.128e-01 1.499e-02 7.519 9.31e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4017 on 1538 degrees of freedom
## Multiple R-squared: 0.3588, Adjusted R-squared: 0.3551
## F-statistic: 95.63 on 9 and 1538 DF, p-value: < 2.2e-16
Note: a linear model is ineffective because it predicts values outside the 0 - 1 range
y_predict <- predict(mod, pts_env, type="response")
y_true <- pts_env$present
range(y_predict)
## [1] NA NA
range(y_true)
## [1] 0 1
To solve this problem, we will apply a Logit Transformation
# fit a generalized linear model with a binomial logit link function
mod <- glm(present ~ ., family = binomial(link="logit"), data = d)
summary(mod)
##
## Call:
## glm(formula = present ~ ., family = binomial(link = "logit"),
## data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6979 -0.7890 -0.1476 0.7383 2.6407
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.677e+01 6.373e+00 -2.631 0.008516 **
## WC_alt 1.781e-04 6.754e-04 0.264 0.792060
## WC_bio1 1.114e+00 1.318e-01 8.453 < 2e-16 ***
## WC_bio4 2.136e-02 1.337e-02 1.597 0.110332
## WC_bio12 -1.051e-02 1.486e-03 -7.075 1.49e-12 ***
## ER_climaticMoistureIndex 1.150e+01 1.503e+00 7.653 1.97e-14 ***
## ER_tri 3.434e-02 8.899e-03 3.859 0.000114 ***
## ER_topoWet 2.057e-01 1.008e-01 2.040 0.041324 *
## lon 1.253e-01 1.628e-02 7.696 1.40e-14 ***
## lat 5.508e-01 1.118e-01 4.927 8.35e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2146.0 on 1547 degrees of freedom
## Residual deviance: 1493.1 on 1538 degrees of freedom
## AIC: 1513.1
##
## Number of Fisher Scoring iterations: 5
Note: we are not within the 0 - 1 range we want to be in
y_predict <- predict(mod, d, type="response")
range(y_predict)
## [1] 0.009644674 0.973727980
termplot(mod, partial.resid = TRUE, se = TRUE, main = F)
We can further improve the GLM by adding “wiggle” to the relationship between the predictor and response variables
librarian::shelf(mgcv)
# fit a generalized additive model with smooth predictors
mod <- mgcv::gam(
formula = present ~ s(WC_alt) +
s(WC_bio1) +
s(WC_bio4) +
s(WC_bio12) +
s(ER_climaticMoistureIndex) +
s(ER_tri) +
s(ER_topoWet) +
s(lon) +
s(lat),
family = binomial, data = d)
summary(mod)
##
## Family: binomial
## Link function: logit
##
## Formula:
## present ~ s(WC_alt) + s(WC_bio1) + s(WC_bio4) + s(WC_bio12) +
## s(ER_climaticMoistureIndex) + s(ER_tri) + s(ER_topoWet) +
## s(lon) + s(lat)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3946 0.5662 -2.463 0.0138 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(WC_alt) 3.780 4.476 8.753 0.058729 .
## s(WC_bio1) 5.456 6.579 36.937 < 2e-16 ***
## s(WC_bio4) 5.523 6.748 21.327 0.003969 **
## s(WC_bio12) 7.546 8.301 32.954 0.000126 ***
## s(ER_climaticMoistureIndex) 6.022 6.982 19.615 0.006694 **
## s(ER_tri) 1.526 1.842 10.125 0.007122 **
## s(ER_topoWet) 4.315 5.375 9.900 0.081225 .
## s(lon) 8.766 8.944 59.976 < 2e-16 ***
## s(lat) 6.803 7.803 26.436 0.000582 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.547 Deviance explained = 50.1%
## UBRE = -0.2428 Scale est. = 1 n = 1548
plot(mod, scale=0)